-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new dataset Mercury #238
Conversation
@SivilTaram FYI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks you very much for submitting this interesting benchmark! LGTM, just two comments:
- did you make sure the current implementation matches the scores reported in your paper for one of the public LLMs?
- can you add some documentation about how to use the benchmark in the docs https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/docs
@loubnabnl Thank you so much for reviewing this code:)
Yes. The scores reported in our paper are based on this implementation. We are also working on publishing a public leaderboard page.
Sure. The instructions have been added. See this commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! ready to merge 🚀
Add a new dataset Mercury
Motivation first:
TL;DR: Mercury is a dataset for evaluating computational efficiency of Python code generation.
Amidst the recent strides in evaluating Large Language Models for Code (Code-LLMs), existing benchmarks have mainly focused on functional correctness, overlooking the importance of computational efficiency.
To fill the gap, we present Mercury, the first computational efficiency benchmark for Code-LLMs. It comprises 1,889 Python tasks, each with adequate solutions to support a runtime distribution. Based on the distribution, we introduce a new metric Beyond, which computes a runtime-percentile-weighted Pass score to reflect functional correctness and computational efficiency simultaneously.
On Mercury, leading Code-LLMs can achieve 65% on Pass, while less than 50% on Beyond. Given that an ideal Beyond score would be aligned with the Pass score, it indicates that while Code-LLMs exhibit impressive capabilities in generating functionally correct code, there remains a notable gap in their efficiency. Finally, our empirical experiments reveal that Direct Preference Optimization (DPO) serves as a robust baseline for enhancing computational efficiency compared with Supervised Fine Tuning (SFT), which paves a promising avenue for future exploration of efficient code generation.\footnote{Our code and data are available on GitHub: [https://github.com/Elfsong/Mercury.
Write a full paragraph describing the feature;
In this work, we introduce Mercury, a novel code generation benchmark designed to assess and improve Code-LLM computational efficiency. It comprises 1,889 Python programming tasks with three difficulty stratification, which is divided into two datasets for model evaluation and fine-tuning separately. For each evaluation task, we assign a test case generator to remedy the shortfall of test case coverage. In measuring computational efficiency, the primary challenge stems from normalizing the absolute runtime across tasks that have diverse runtime ranges. Thus, we collect and locally execute numerous historical solutions for each task to form a runtime distribution and leverage the runtime percentile of LLM-generated code on the distribution instead of the absolute runtime to evaluate computational efficiency. Furthermore, to mitigate performance discrepancies attributed to irrelevant processes and diverse hardware configurations, we set up an isolated sandbox environment for task execution to establish local runtime distributions. More details can be found in the paper: https://arxiv.org/abs/2402.07844
Provide a code snippet that demonstrates its future use;
In case this is related to a paper, please attach a link;
More details can be found in the paper: https://arxiv.org/abs/2402.07844